Duplicate Detection for Symbolically Compressed Documents
نویسندگان
چکیده
A new family of symbolic compression algorithms has recently been developed that includes the ongoing JBIG2 standardization effort as well as related commercial products. These techniques are specifically designed for binary document images. They cluster individual blobs in a document and store the sequence of occurrence of blobs and representative blob templates, hence the name symbolic compression. This paper describes a method for duplicate detection on symbolically compressed document images. It recognizes the text in an image by deciphering the sequence of occurrence of blobs in the compressed representation. We propose a Hidden Markov Model (HMM) method for solving such deciphering problems and suggest applications in multilingual document duplicate detection.
منابع مشابه
Duplicate Detection in Symbolically Compressed Documents
A new family of symbolic compression algorithms, such as the ongoing JBIG2 standardization and commercial products, has recently been developed. These techniques are specifically targeted for binary document images. They cluster individual blobs in a document and store the sequence of occurrence of blobs and representative blob templates, hence the name symbolic compression. This paper describe...
متن کاملDetecting duplicates among symbolically compressed images in a large document database
The detection of duplicate images is a useful means of indexing a large database of documents. An algorithm for duplicate document detection is proposed in this paper that operates directly on images that have been symbolically compressed using techniques related to the ongoing JBIG2 standardization eort. This paper describes a hidden Markov model (HMM) method that recognizes the text in an im...
متن کاملInformation Extraction from Symbolically Compressed Document Images
The extraction of information from symbolically compressed document images is an increasingly important problem as the related standard (JBIG2) and commercial products become available. Symbolic compression techniques work by clustering individual connected connected components (blobs) in a document image and storing the sequence of occurrence of blobs and representative blob templates, hence t...
متن کاملGroup 4 Compressed Document Matching
Numerous approaches, including textual, structural and featural, for detecting duplicate documents have been investigated. Considering document images are usually stored and transmitted in compressed forms, it is advantageous to perform document matching directly on the compressed data. A two-stage process for matching Group 4 compressed document images is presented. In the coarse matching stag...
متن کاملSubstitution Deciphering Based on HMMs with Applications to Compressed Document Processing
It has been shown that simple substitution ciphers can be solved using statistical methods such as probabilistic relaxation. However, the utility of such solutions has been limited by their inability to cope with noise encountered in practical applications. In this paper, we propose a new solution to substitution deciphering based on hidden Markov models. We show that our algorithm is more accu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999